2018-09-01

Why do we love R so much for data-analysis?

  • R is very interactive, Q&A with your data

Why do we love R so much for data-analysis?

  • R has fantastic functionalities for plotting

Why do we love R so much for data-analysis?

  • R is super rich in statistical models

Why do we love R so much for data-analysis?

  • We can program in R

Why do we love R so much for data-analysis?

  • We don't have to use R when using R

Why do we love R so much for data-analysis?

  • We don't have to use R when using R

We can do

library(dplyr)
mtcars %>% mutate(cyl_drat = cyl + drat)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb cyl_drat
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4     9.90
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4     9.90
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1     7.85
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1     9.08
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    11.15
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1     8.76
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    11.21
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2     7.69
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2     7.92
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4     9.92
## 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4     9.92
## 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3    11.07
## 13 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3    11.07
## 14 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3    11.07
## 15 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4    10.93
## 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4    11.00
## 17 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4    11.23
## 18 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1     8.08
## 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2     8.93
## 20 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1     8.22
## 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1     7.70
## 22 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2    10.76
## 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2    11.15
## 24 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4    11.73
## 25 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2    11.08
## 26 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1     8.08
## 27 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2     8.43
## 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2     7.77
## 29 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4    12.22
## 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6     9.62
## 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8    11.54
## 32 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2     8.11

or

mtcars_dt <- data.table::as.data.table(mtcars)
mtcars_dt[, cyl_drat := cyl + drat]

Instead of

mtcars$cyl_drat <- mtcars$cyl + mtcars$drat

We all use NSE!

When you started using R, did you mix up?

install.packages("padr")

and

library(padr)

Or wondered why the library(padr) worked. Even when there is no variable callendpadr?

We all use NSE!

Apparantly, things that ought not to work, are working.

This results is a language full of magic (also in base):

subset(mtcars, cyl == 6)

ggplot2::ggplot(mtcars, aes(mpg, drat)) +
  geom_point()

data.table::as.data.table(mtcars)[ ,mean(mpg), by = cyl]

Why data analysts love it and cs people don't

R is designed to do data science. (Well, then it was still called statistics).

Flexibility to maximize insights.

Enable DSL creation to tailor make tools to solve a specific problem without overhead.

With flexibility comes ambiguity and responsibility.

What is this talk about?

What is standard in the first place?

my_val <- 123

my_func <- function(x) {
  x / 42 * 121
}

my_func(71)
## [1] 204.5476
my_func(my_val)
## [1] 354.3571
my_func(your_val)
## Error in my_func(your_val): object 'your_val' not found

What's in a NAME

By creating a variable we assign a value to a name.

my_val <- 123

123 is the value that is bound to the name my_val.

Binding happens in an environment, in this case the global.

What's in a NAME

my_val <- 123

123 is the value that is bound to the name my_val.

Binding happens in an environment, in this case the global.

Just call my name, honey, I'll give you the value:

my_val
## [1] 123

Lexical scoping

R starts looking for the value of name in the local environment.

x <- "a variable in the global"
a_func <- function() {
  x <- "a variable in the local"
  x
}
a_func()
## [1] "a variable in the local"

Lexical scoping

When it can't find it locally, move up to the parent environment (where the function was created).

z <- "a variable in the global"
another_func <- function() {
  z
}
another_func()
## [1] "a variable in the global"

Lexical scoping

Finally, an error is thrown when the variable can't be found.

nobody_loves_me <- function() {
  y
}
nobody_loves_me()
## Error in nobody_loves_me(): object 'y' not found

So this is standard evaluation in R.

Wait for it

When evaluating a name we look for the value bound to it. We err when we can't find it.

We can also ask R to postpone judgement, by storing the request in a name object.

quote(my_unknown_var) %>% class()
## [1] "name"

Wait for it

When evaluating a name we look for the value bound to it. R errs when it can't find the value.

We can also ask R to postpone judgement, by storing the request in a name object.

quote(my_unknown_var) %>% class()
## [1] "name"

This is the act of quoting, saving something to be evaluated later.

Wait for it

Quoted variable names are not evaluated. It doesn't matter if they don't exist.

quated_var <- quote(wait_for_it)
quated_var
## wait_for_it

Wait for it

Quoted variable names are not evaluated. It doesn't matter if they don't exist.

quoted_var <- quote(wait_for_it)
quoted_var
## wait_for_it

Wait for it

It will start looking for the value only when we ask to evaluate it.

eval(quoted_var)
## Error in eval(quoted_var): object 'wait_for_it' not found

Wait for it

wait_for_it <- "I finally have a value"
eval(quoted_var)
## [1] "I finally have a value"

Building our own select

diy_select <- function(x, name) {
  eval(name, envir = x)
}

diy_select(mtcars, quote(cyl))
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
diy_select(mtcars, quote(vs))
##  [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1

Not just names

We can quote the following things:

  • name: the name of an R object

  • call: calling of a function

  • pairlist: something from the past you shouldn't bother about

  • literal: evaluates to the value itself

Expressions: "don't be another SQL"

Call

Just like a name, a function call can be delayed by quoted.

my_little_filter <- function(x, 
                             call) {
  x[eval(call, envir = x), ]
}

my_little_filter(mtcars, quote(cyl == 4)) %>% head(2)
##             mpg cyl  disp hp drat   wt  qsec vs am gear carb cyl_drat
## Datsun 710 22.8   4 108.0 93 3.85 2.32 18.61  1  1    4    1     7.85
## Merc 240D  24.4   4 146.7 62 3.69 3.19 20.00  1  0    4    2     7.69

Quoting inside the function

You'll never have to quote your function arguments when using a DSL.

mtcars %>% select(cyl)
as.data.table(mtcars)[, cyl]
ggplot(mtcars, aes(cyl)) + geom_bar()

Why does R not throw an error? There is no cyl in the global…

Lazy, lazy R

Lazy, lazy R

koala <- function(x, y) {
  x + 42
}

koala(3)
## [1] 45

Industrious Python

def koala(x, y):
  return(x + 42)
koala(3)
## TypeError: koala() takes exactly 2 arguments (1 given)
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>

Quoting inside a function

So, R doesn't make a fuz until it realy has to.

This allows quoting inside functions.

my_second_little_filter <- function(x, bare_call) {
  call <- quote(bare_call)
  x[eval(call, envir = x), ]
}

my_second_little_filter(mtcars, cyl == 4) %>% head(2)
## Error in eval(call, envir = x): object 'cyl' not found

Why isn't this working?

Quoting inside a function

quote does literally quote the input, but we want to quote the value of the argument, not the name.

Here we need substitute:

substitute_example <- function(x) {
  substitute(x)
}
substitute_example(cyl == 4)
## cyl == 4
substitute_example(cyl == 4) %>% class()
## [1] "call"

That's a promise

That's a promise

So all function arguments are quoted and stored in the promise, alongside the value.

substitute retrieves the expression.

All together

my_correct_second_little_filter <- function(x, bare_call) {
  call <- substitute(bare_call)
  x[eval(call, envir = x), ]
}

my_correct_second_little_filter(mtcars, cyl == 4) %>% head(1)
##             mpg cyl disp hp drat   wt  qsec vs am gear carb cyl_drat
## Datsun 710 22.8   4  108 93 3.85 2.32 18.61  1  1    4    1     7.85
  • The call cyl == 4 on itself is invalid, there is no cyl variable in the globla.
  • But, R refrains from judgement, stores it in a promise.
  • substitute retrieves just the expression, which is the quoted call.
  • This expression is evaluated within the environment of x.
  • Here it is completely valid, because there is a cyl variable.

Quoting strings

func1 <- function() "Calling function 1"
func2 <- function() "Calling function 2"

func_caller <- function(nr) { 
  eval(parse(text = paste0("func", nr)))()
}

func_caller(1)
## [1] "Calling function 1"
func_caller(2)
## [1] "Calling function 2"

An actual useful example

get_source_data <- function(nr,
                            rerun = FALSE) {
  file_path <- paste0("data/source_data_", nr, ".Rdata")
  if (file.exists(file_path) && !rerun) {
    load(file_path)
  } else {
    assign(paste0("source_data_", nr), 
           parse(text = paste0("query_", nr)) %>% eval())
    save(list = paste0("source_data_", nr), file = file_path)
  }
  parse(text = paste0("source_data_", nr)) %>% eval()
}

Tidyeval

The tidyverse is implemented using NSE.

mtcars %>% select(cyl)

We now know that cyl gets somehow quoted by select and evaluated within the data frame.

But what if we want to wrap tidyverse code in a custom function?

Tidyeval - custom function

This won't work

my_grouping_func <- function(x, grouping_var) {
  x %>% 
    group_by(grouping_var) %>% 
    summarise(max_drat = max(drat))
}
my_tv_func(mtcars, cyl)

Why?

Tidyeval - custom function

In order to get it to work:

  • quote the variable, like in regular R
  • unquote again before the argument is swallowed by the tidyverse function
  • then tidyverse function can go back and quote it again

Tidyeval - custom function

In order to get it to work:

  • quote the variable upfront
  • unquote again before the argument is swallowed by the tidyverse function
  • then tidyverse function can go back and quote it again
my_grouping_func <- function(x, grouping_var) {
  x %>% 
    group_by(!!grouping_var) %>% 
    summarise(max_drat = max(drat))
}
my_grouping_func(mtcars, quo(cyl))
## # A tibble: 3 x 2
##     cyl max_drat
##   <dbl>    <dbl>
## 1     4     4.93
## 2     6     3.92
## 3     8     4.22

Tidyeval - custom function

Just like using substitute you can quote the arguments value with enquo.

my_grouping_func <- function(x, grouping_var) {
  grouping_var_q <- enquo(grouping_var)
  x %>% 
    group_by(!!grouping_var_q) %>% 
    summarise(max_drat = max(drat))
}
my_grouping_func(mtcars, cyl)
## # A tibble: 3 x 2
##     cyl max_drat
##   <dbl>    <dbl>
## 1     4     4.93
## 2     6     3.92
## 3     8     4.22